Protein - Structure Mapping, Alignments, and Visualization¶
This notebook gives an example of how to map a single protein sequence to its structure, along with conducting sequence alignments and visualizing the mutations.
Imports¶
In [ ]:
import sys
import logging
In [ ]:
# Import the Protein class
from ssbio.core.protein import Protein
In [ ]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging¶
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [ ]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [ ]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROTEIN_ID
will be created
- The directory where a folder named after your
PROTEIN_ID
- Your protein ID
PROTEIN_SEQ
- Your protein sequence
A directory will be created in ROOT_DIR
with your PROTEIN_ID
name. The folders are organized like so:
ROOT_DIR
└── PROTEIN_ID
├── sequences # Protein sequence files, alignments, etc.
└── structures # Protein structure files, calculations, etc.
In [ ]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROTEIN_ID = 'SRR1753782_00918'
PROTEIN_SEQ = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
-
class
ssbio.core.protein.
Protein
(ident, description=None, root_dir=None, pdb_file_type=’mmtf’)[source] Store information about a protein, which represents the monomeric translated unit of a gene.
The main utilities of this class are to:
- Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains)
protein sequences as SeqProp objects in the
sequences
attribute - Load, parse, and store multiple experimental or predicted protein structures as StructProp
objects in the
structures
attribute - Set a single
representative_sequence
andrepresentative_structure
- Calculate, store, and access pairwise sequence alignments to the representative sequence or structure
- Provide summaries of alignments and mutations seen
- Map between residue numbers of sequences and structures
Parameters: - ident (str) – Unique identifier for this protein
- description (str) – Optional description for this protein
- root_dir (str) – Path to where the folder named by this protein’s ID will be created. Default is current working directory.
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB
Todo
- Implement structural alignment objects with FATCAT
- Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains)
protein sequences as SeqProp objects in the
In [ ]:
# Create the Protein object
my_protein = Protein(ident=PROTEIN_ID, root_dir=ROOT_DIR, pdb_file_type='mmtf')
In [ ]:
# Load the protein sequence
# This sets the loaded sequence as the representative one
my_protein.load_manual_sequence(seq=PROTEIN_SEQ, ident='WT', write_fasta_file=True, set_as_representative=True)
Mapping sequence –> structure¶
Since the sequence has been provided, we just need to BLAST it to the PDB.
Methods¶
-
Protein.
blast_representative_sequence_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source] BLAST the representative protein sequence to the PDB. Saves a raw BLAST result file (XML file).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded XML files, must be set if protein directory was not initialized
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False.
Returns: List of new
PDBProp
objects added to thestructures
attributeReturn type: list
In [ ]:
# Mapping using BLAST
my_protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)
my_protein.df_pdb_blast.head()
Downloading and ranking structures¶
Methods¶
-
Protein.
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source] Download ALL mapped experimental structures to the protein structures directory.
Parameters: - outdir (str) – Path to output directory, if protein structures directory not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
Returns: List of PDB IDs that were downloaded
Return type: list
Todo
- Parse mmtf or PDB file for header information, rather than always getting the cif file for header info
In [ ]:
# Download all mapped PDBs and gather the metadata
my_protein.pdb_downloader_and_metadata()
my_protein.df_pdb_metadata.head(2)
-
Protein.
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, clean=True, keep_chemicals=None, skip_large_structures=False, force_rerun=False)[source] Set a representative structure from a structure in the structures attribute.
- Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if Protein directory was not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if Protein directory was not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- clean (bool) – If structure should be cleaned
- keep_chemicals (str, list) – Keep specified chemical names if structure is to be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Returns: Representative structure from the list of structures. This is a not a map to the original structure, it is copied and optionally cleaned from the original one.
Return type:
In [ ]:
# Set representative structures
my_protein.set_representative_structure()
Loading and aligning new sequences¶
You can load additional sequences into this protein object and align them to the representative sequence.
Methods¶
-
Protein.
load_manual_sequence
(seq, ident=None, write_fasta_file=False, outdir=None, set_as_representative=False, force_rewrite=False)[source] Load a manual sequence given as a string and optionally set it as the representative sequence. Also store it in the sequences attribute.
Parameters: - seq (str, Seq, SeqRecord) – Sequence string, Biopython Seq or SeqRecord object
- ident (str) – Optional identifier for the sequence, required if seq is a string. Also will override existing IDs in Seq or SeqRecord objects if set.
- write_fasta_file (bool) – If this sequence should be written out to a FASTA file
- outdir (str) – Path to output directory
- set_as_representative (bool) – If this sequence should be set as the representative one
- force_rewrite (bool) – If the FASTA file should be overwritten if it already exists
Returns: Sequence that was loaded into the
sequences
attributeReturn type:
In [ ]:
# Input your mutated sequence and load it
mutated_protein1_id = 'N17P_SNP'
mutated_protein1_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
my_protein.load_manual_sequence(ident=mutated_protein1_id, seq=mutated_protein1_seq)
In [ ]:
# Input another mutated sequence and load it
mutated_protein2_id = 'Q4S_N17P_SNP'
mutated_protein2_seq = 'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
my_protein.load_manual_sequence(ident=mutated_protein2_id, seq=mutated_protein2_seq)
-
Protein.
pairwise_align_sequences_to_representative
(gapopen=10, gapextend=0.5, outdir=None, engine=’needle’, parse=True, force_rerun=False)[source] Pairwise all sequences in the sequences attribute to the representative sequence. Stores the alignments in the
sequence_alignments
DictList attribute.Parameters: - gapopen (int) – Only for
engine='needle'
- Gap open penalty is the score taken away when a gap is created - gapextend (float) – Only for
engine='needle'
- Gap extension penalty is added to the standard gap penalty for each base or residue in the gap - outdir (str) – Only for
engine='needle'
- Path to output directory. Default is the protein sequence directory. - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
- force_rerun (bool) – Only for
engine='needle'
- Default False, set to True if you want to rerun the alignment if outfile exists.
- gapopen (int) – Only for
In [ ]:
# Conduct pairwise sequence alignments
my_protein.pairwise_align_sequences_to_representative()
In [ ]:
# View IDs of all sequence alignments
[x.id for x in my_protein.sequence_alignments]
# View the stored information for one of the alignments
my_alignment = my_protein.sequence_alignments.get_by_id('SRR1753782_00918_N17P_SNP')
my_alignment.annotations
str(my_alignment[0].seq)
str(my_alignment[1].seq)
-
Protein.
sequence_mutation_summary
(alignment_ids=None, alignment_type=None)[source] Summarize all mutations found in the sequence_alignments attribute.
Returns 2 dictionaries, single_counter and fingerprint_counter.
- single_counter:
Dictionary of
{point mutation: list of genes/strains}
Example:{ ('A', 24, 'V'): ['Strain1', 'Strain2', 'Strain4'], ('R', 33, 'T'): ['Strain2'] }
Here, we report which genes/strains have the single point mutation.
- fingerprint_counter:
Dictionary of
{mutation group: list of genes/strains}
Example:{ (('A', 24, 'V'), ('R', 33, 'T')): ['Strain2'], (('A', 24, 'V')): ['Strain1', 'Strain4'] }
Here, we report which genes/strains have the specific combinations (or “fingerprints”) of point mutations
Parameters: - alignment_ids (str, list) – Specified alignment ID or IDs to use
- alignment_type (str) – Specified alignment type contained in the
annotation
field of an alignment object,seqalign
orstructalign
are the current types.
Returns: single_counter, fingerprint_counter
Return type: dict, dict
In [ ]:
# Summarize all the mutations in all sequence alignments
s,f = my_protein.sequence_mutation_summary(alignment_type='seqalign')
print('Single mutations:')
s
print('---------------------')
print('Mutation fingerprints')
f
Some additional methods¶
Getting binding site/other information from UniProt¶
In [ ]:
import ssbio.databases.uniprot
In [ ]:
this_examples_uniprot = 'P14062'
sites = ssbio.databases.uniprot.uniprot_sites(this_examples_uniprot)
my_protein.representative_sequence.features = sites
my_protein.representative_sequence.features
Mapping sequence residue numbers to structure residue numbers¶
Methods¶
-
Protein.
map_seqprop_resnums_to_structprop_resnums
(resnums, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source] Map a residue number in any SeqProp to the structure’s residue number for a specified chain.
Parameters: - resnums (int, list) – Residue numbers in the sequence
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – Chain ID to map to
- use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns: Mapping of sequence residue numbers to structure residue numbers
Return type: dict
In [ ]:
# Returns a dictionary mapping sequence residue numbers to structure residue identifiers
# Will warn you if residues are not present in the structure
structure_sites = my_protein.map_seqprop_resnums_to_structprop_resnums(resnums=[1,3,45],
use_representatives=True)
structure_sites
Viewing structures¶
The awesome package nglview is
utilized as a backend for viewing structures within a Jupyter notebook.
ssbio
view functions will either return a NGLWidget
object,
which is the same as using nglview
like the below example, or act
upon the widget object itself.
# This is how NGLview usually works - it will load a structure file and return a NGLWidget "view" object.
import nglview
view = nglview.show_structure_file(my_protein.representative_structure.structure_path)
view
Methods¶
-
StructProp.
view_structure
(only_chains=None, opacity=1.0, recolor=False, gui=False)[source] Use NGLviewer to display a structure in a Jupyter notebook
Parameters: - only_chains (str, list) – Chain ID or IDs to display
- opacity (float) – Opacity of the structure
- recolor (bool) – If structure should be cleaned and recolored to silver
- gui (bool) – If the NGLview GUI should show up
Returns: NGLviewer object
In [ ]:
# View just the structure
view = my_protein.representative_structure.view_structure()
view
-
Protein.
add_mutations_to_nglview
(view, alignment_type=’seqalign’, alignment_ids=None, seqprop=None, structprop=None, chain_id=None, use_representatives=False, grouped=False, color=’red’, unique_colors=True, opacity_range=(0.8, 1), scale_range=(1, 5))[source] Add representations to an NGLWidget view object for residues that are mutated in the
sequence_alignments
attribute.Parameters: - view (NGLWidget) – NGLWidget view object
- alignment_type (str) – Specified alignment type contained in the
annotation
field of an alignment object,seqalign
orstructalign
are the current types. - alignment_ids (str, list) – Specified alignment ID or IDs to use
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
- grouped (bool) – If groups of mutations should be colored and sized together
- color (str) – Color of the mutations (overridden if unique_colors=True)
- unique_colors (bool) – If each mutation/mutation group should be colored uniquely
- opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
- scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
In [ ]:
# Map the mutations on the visualization (scale increased) - will show up on the above view
my_protein.add_mutations_to_nglview(view=view, alignment_type='seqalign', scale_range=(4,7),
use_representatives=True)
-
Protein.
add_features_to_nglview
(view, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source] Add select features from the selected SeqProp object to an NGLWidget view object.
Currently parsing for:
- Single residue features (ie. metal binding sites)
- Disulfide bonds
Parameters: - view (NGLWidget) – NGLWidget view object
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
In [ ]:
# Add sites as shown above in the table to the view
my_protein.add_features_to_nglview(view=view, use_representatives=True)
Saving¶
-
Protein.
save_json
(outfile, compression=False) Save the object as a JSON file using json_tricks
In [ ]:
import os.path as op
my_protein.save_json(op.join(my_protein.protein_dir, '{}.json'.format(my_protein.id)))